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Detailed protocol 


Command Line Processing of Sequencing Reads 

The following list describes of the key steps to follow for command line processing of C-RNA sequencing data. Table | specifies the exact software used for each 
processing step. Multiple software packages and versions exist for many of these fictions, and these will typically produce comparable results. However, it is beyond the scope 
of this document to delineate all possible options, or their corresponding caveats. 


1. Demultiplex: generate fastq files for each sample fromraw sequencing data. 
a. Ifa given library is sequenced on multiple sequencing runs, the fastq files must be concatenated before proceeding 
2. Downsample: randomly subsample a specified number of fastq reads 
a. 50M reads (maximum) was used for this publication. For this protocol, we recommend processing at least 20M reads per sample. Higher sequencing depths 
can reduce technical noise, though there appear to be diminishing returns with >100M reads per sample. 
3. Filter Abundant Sequences: Remove sequencing reads ftom common contaminants 
a. Important: Save the remaining (non-abundant) reads in fastq format with the “--un’” option. These files are the input in step 4. 
b. Use a reference index containing sequences for undesired sequences such as Illumina adapters, chrM, human ribosomal and 5S DNA, phage phiX174, polyA 
and polyC. The appropriate composition of this reference may depend on experimental details 
4. Map to Reference Genome: Identify where in the genome (and/or transcriptome) each sequencing read originates from. 
a. Bowtie2 (/) must be installed; we used v2.2.3. 
5. Sort Mapped Reads: Lexicographical sorting of sequencing reads by mapping coordinates. 
6. Index Mapped Reads: Generate an index file used in step 7. 
7. Count Reads per Transcript: 
a. Ensure that the transcriptome reference file is in GTF format, and from the same genome assembly as used for mapping in step 4. 


Table 1. Command line software used for data processing. 


Recommended 
Software Sub-Function* Version Command 
Line Options* 
--barcode- 
mismatches 0 
--ignore- 
missing-bels 
1 Bel2fastq2 - v2.20.0.422 --ignore- 
missing- filter 
--ignore- 
missing- 
positions 
2 Seqtk (2) sample v1.2-r102-dirty -2 
-k 1 
-n0 
-125 
--mapq 10 
--no-coverage- 
search 
4 TopHat2 (4) - v2.0.13 --no-novel- 
indels 
--b2-fast 
5 Picard (5) ReorderSam v1.93 - 


Processing 
Step 


3 Bowtie (3) - v1.0.0 


6 Samtools (6) index v0.1.19 - 


7 Subread (7) —— featureCounts vl.4.6 -Q 10 


[> 


-P 


* - indicates the category is not applicable for the given processing step 


Dataset Quality Control (QC) Assessments 

QC checks are crucial to have confidence in the quality of datasets — and therefore the conclusions drawn from them. However, which metrics are most informative and 
what thresholds to set are likely dependent on a number of experimental variables, including sample type and quantity, C-RNA extraction, enrichment, and sequencing library 
preparation protocols. 

While recommendations for universal QC requirements are not yet possible, we strongly advise any users to examine the quality of all datasets prior to downstream 
analyses. Commonly useful tools and data checks are listed below. 


e The percent ofreads excluded in processing step 3 (filtering abundant sequences). If this value is large, the specific sequence(s) represented may help troubleshoot assay 
performance issues. 
e Mapping rates (processing step 4). Samples with abnormally low values warrant further examination. 
e Software packages 
o FastQC (8): provides a variety of usefill measurements about sequencing run quality. 
o Preseq (9): estimates library complexity, abnormally high or low values warrant further investigation. 
o RSeQC (0): provides a variety of usefill measurements about RNAseq data quality. Gene body coverage and transcript abundance saturation are particularly 
informative. 
o BLAST (//): running BLAST ona selection of unmapped reads can confirm or rule out unexpected contaminations. 


Group Comparison Analyses 

Much more characterization is needed froma wide range of applications, populations, and preparations before universal recommendations can be made for how best to 
generate biological interpretations from C-RNA sequencing data; and our approaches used in the provided code may or may not be optimized for a different dataset. 

We developed custom analyses to address a specific challenge: high biological variability. Even with relatively large sample sizes, we observed inconsistent results when 
using standard tools. We theorize that the heterogeneity of the disease preeclampsia and the diverse and numerous sources of C-RNA manifests as abnormally prevalent outlier 
signals as well as smaller fold-change differences and lower signal-to-noise ratios than standard, single-tissue RNA-Seq. 

The attached CRNA-DEX-ANALYSIS-sharing.R.txt R script contains the code used to run differential expression analysis (with and without jackknifing) on transcript 
read counts obtained ftom bioinformatic processing, The attached CRNA-ADABOOST-ANALYSIS-sharing,py.txt python script contains the code used to optimize 
hyperparameters and fit AdaBoost models to the iPEC C-RNA data. As the AdaBoost implementation is not determmistic, users should expect the models generated to be 
similar, but not necessarily identical, ifrun multiple times. Input file formatting requirements are described in each script. 
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